Fix: Make the janitor best effort#5802
Conversation
Signed-off-by: Themis Valtinos <73662635+themisvaltinos@users.noreply.github.com>
| num_expired_snapshots += len(batch.expired_snapshot_ids) | ||
| except Exception as e: | ||
| message = f"Failed to clean up an expired snapshots batch: {e}" | ||
| logger.warning(message) |
| self.state_sync.delete_expired_environments(current_ts=current_ts) | ||
| # we want to retry on the next janitor pass if drops failed | ||
| if not failures: | ||
| self.state_sync.delete_expired_environments(current_ts=current_ts) |
There was a problem hiding this comment.
In the case where a schema was permanently deleted by the user, prior to the janitor running, won't this cause the environment to never expire? I feel like we need an escape hatch.
There was a problem hiding this comment.
so this pr unblocks the janitor so it doesn't accumulate stale snapshots in the case of one failure and it can continue with the rest of the operations
this guard to not delete the expired environments in this case of a failure is matching the current behaviour we have of retrying the next time that the janitor runs rather than deleting the expired environments and leaving orphans in the db. the test: test_janitor_cleanup_order is an example of this behaviour that we currently want
I see the value of an escape hatch for the cases where drops persistently fail, but this would require a more deliberate mechanism. I can't think of a simple way to do it as part of this pr without changing the janitor design substantially unless you have a suggestion of a simple way
Signed-off-by: Themis Valtinos <73662635+themisvaltinos@users.noreply.github.com>
Signed-off-by: Themis Valtinos <73662635+themisvaltinos@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Updates SQLMesh’s janitor workflow to be “best effort”: it attempts all cleanup operations, collects failures, and surfaces them together at the end (with optional warning-only behavior and a --force-delete mode to purge state even when physical drops fail).
Changes:
- Make environment view/schema/catalog cleanup return a list of failures instead of raising immediately.
- Make expired snapshot deletion best-effort, optionally deleting snapshot state even when physical cleanup fails (
force_delete). - Aggregate and surface janitor failures in
Context._run_janitor, add CLI plumbing and expand test coverage.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/core/test_snapshot_evaluator.py | Updates expectations for snapshot cleanup failures now surfaced as SQLMeshError. |
| tests/core/test_janitor.py | Adjusts janitor unit tests for best-effort cleanup and failure collection. |
| tests/core/integration/test_aux_commands.py | Adds/updates integration tests for aggregated failures, warn-only mode, and force_delete. |
| sqlmesh/core/snapshot/evaluator.py | Switches snapshot cleanup to collect concurrent failures and raise a combined error. |
| sqlmesh/core/janitor.py | Returns failure lists for cleanup helpers; adds force_delete support for snapshot state deletion on failure. |
| sqlmesh/core/context.py | Aggregates janitor failures and applies warn/raise behavior; adds force_delete parameter plumbing. |
| sqlmesh/cli/main.py | Adds --force-delete option and passes it through to Context.run_janitor. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| failure_string = "\n - ".join(failures) | ||
| summary = f"Janitor completed with failures:\n {failure_string}" |
| if errors: | ||
| errored_snapshots = "\n".join(f" {e.node.name}: {e.__cause__}" for e in errors) | ||
| raise SQLMeshError(f"\n{errored_snapshots}") |
| message = f"Failed to drop the expired environment view '{expired_view}': {e}" | ||
| if warn_on_delete_failure: | ||
| logger.warning(message) | ||
| else: | ||
| raise SQLMeshError(message) from e | ||
| logger.warning(message) | ||
| failures.append(message) |
| except Exception as e: | ||
| message = f"Failed to drop the expired environment schema '{schema}': {e}" | ||
| if warn_on_delete_failure: | ||
| logger.warning(message) | ||
| else: | ||
| raise SQLMeshError(message) from e | ||
| logger.warning(message) | ||
| failures.append(message) |
| except Exception as e: | ||
| message = f"Failed to drop the expired environment catalog '{catalog}': {e}" | ||
| if warn_on_delete_failure: | ||
| logger.warning(message) | ||
| else: | ||
| raise SQLMeshError(message) from e | ||
| logger.warning(message) | ||
| failures.append(message) |
This updates the janitor instead of stopping on the first drop failure, to attempt every cleanup operation and aggregates all failures into a single error raised at the end. The
warn_on_delete_failureis also respected so in that case it is set to true it would warn rather than raise.It also introduces the force delete option. If the janitor fails to drop a view, schema or snapshot table it retains the corresponding state records so the next run can retry. This is correct for transient failures, but if the failure is permanent the janitor will retry forever and the affected environments and snapshots will never be cleaned up.
In that case, with the
--force-deleteflag:purges the state records for expired environments and snapshots even when their physical objects could not be dropped. But any objects that were not successfully dropped becomes orphaned and must be removed manually from the data warehouse.